{ "cells": [ { "cell_type": "markdown", "id": "56362576", "metadata": {}, "source": [ "# Homework 2\n", "\n", "In this homework we'll explore decision trees and overfitting, and learn about the right way to evaluate the performance of a classifier." ] }, { "cell_type": "code", "execution_count": 114, "id": "d5cb5b7d", "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "from sklearn import tree\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score\n", "import numpy as np\n", "import random\n", "from sklearn.tree import export_text\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 115, "id": "4f705984", "metadata": {}, "outputs": [], "source": [ "def make_dataset(n, d = 4, p = 0):\n", " \"\"\"\n", " Create a dataset with boolean features and a binary class label.\n", " The label is assigned as x1 ^ x2 V x3 ^ x4.\n", " \n", " Arguments:\n", " n - The number of instances to generate\n", " m - The number of features per instance. Any features beyond the first four\n", " are irrelevant to determining the class label.\n", " p - The probability that the true class label as computed by the expression\n", " above is flipped. Said differently, this is the probability of class noise.\n", " \"\"\"\n", " \n", " assert d >= 4, 'The dataset must have at least 4 features'\n", " X = [np.random.randint(2, size = d) for _ in range(n)]\n", " y = [(x[0] and x[1]) or (x[2] and x[3]) for x in X]\n", " y = [v if random.random() >= p else (v + 1) % 2 for v in y]\n", " return X, y" ] }, { "cell_type": "markdown", "id": "e655d42d", "metadata": {}, "source": [ "When evaluating the accuracy of a classifier, the right way to do it is to have a test set of instances that were not used to train the classifier and measure on those instances. The [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function in scikit makes it easy to create training and testing sets. Below is an example that shows overfitting as evidenced by higher accuracy on the training set than the testing set." ] }, { "cell_type": "code", "execution_count": 116, "id": "c9296ffb", "metadata": {}, "outputs": [], "source": [ "# Create a dataset with 1000 instances, each with 10 attributes, and 10% class noise\n", "X, y = make_dataset(1000, d = 10, p = 0.1)" ] }, { "cell_type": "code", "execution_count": 117, "id": "68be6e50", "metadata": {}, "outputs": [], "source": [ "# Make training and testing sets, each with half of the data\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, train_size=0.5)" ] }, { "cell_type": "code", "execution_count": 118, "id": "c9c91fa8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Training accuracy: 0.96\n", "Testing accuracy: 0.78\n" ] } ], "source": [ "# Train the classifier and evaluate it on train/test splits\n", "clf = tree.DecisionTreeClassifier()\n", "clf.fit(X_train, y_train)\n", "print('Training accuracy: %.2f' % accuracy_score(y_train, clf.predict(X_train)))\n", "print('Testing accuracy: %.2f' % accuracy_score(y_test, clf.predict(X_test)))" ] }, { "cell_type": "markdown", "id": "4b80c59b", "metadata": {}, "source": [ "Note that if the training set has 0% class noise, we get a perfect tree. Spend some time convincing yourself that the tree below captures the boolean expression that assigns class labels." ] }, { "cell_type": "code", "execution_count": 119, "id": "119f8f27", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "|--- feature_3 <= 0.50\n", "| |--- feature_1 <= 0.50\n", "| | |--- class: 0\n", "| |--- feature_1 > 0.50\n", "| | |--- feature_0 <= 0.50\n", "| | | |--- class: 0\n", "| | |--- feature_0 > 0.50\n", "| | | |--- class: 1\n", "|--- feature_3 > 0.50\n", "| |--- feature_2 <= 0.50\n", "| | |--- feature_0 <= 0.50\n", "| | | |--- class: 0\n", "| | |--- feature_0 > 0.50\n", "| | | |--- feature_1 <= 0.50\n", "| | | | |--- class: 0\n", "| | | |--- feature_1 > 0.50\n", "| | | | |--- class: 1\n", "| |--- feature_2 > 0.50\n", "| | |--- class: 1\n", "\n" ] } ], "source": [ "X, y = make_dataset(1000, d = 10, p = 0.0)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, train_size=0.5)\n", "clf = tree.DecisionTreeClassifier()\n", "clf.fit(X_train, y_train)\n", "print(export_text(clf))" ] }, { "cell_type": "markdown", "id": "f89da077", "metadata": {}, "source": [ "# Assignment\n", "\n", "Explore the impact of the following on the extent of overfitting:\n", "* The size of the dataset (n in the call to make_dataset)\n", "* The number of irrelevant features (d in the call to make_dataset)\n", "* The probability of class noise (p in the call to make_dataset)\n", "* The minimum number of samples required for a node to be split. That is the min_samples_split parameter to the [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) constructor\n", "\n", "
\n",
" \n",
"Vary each of the parameters above and build learning curves for training and testing accuracy, plot them, and for each of the parameters write up an explanation for the impact the parameter has on overfitting. Also, in each case, display at least one decision tree and explain what is happening that is making it overfit."
]
},
{
"cell_type": "markdown",
"id": "2c30ebe7",
"metadata": {},
"source": [
"Here is an example of generating a learning curve for a fixed size dataset where the fraction of instances used for training is varied. You can use this template to create your own learning curves."
]
},
{
"cell_type": "code",
"execution_count": 122,
"id": "606a3a3e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"